96 research outputs found

    3D AUDIO-VISUAL SPEAKER TRACKING WITH AN ADAPTIVE PARTICLE FILTER

    Get PDF
    reserved4siWe propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audio-visual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audio-visual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.Qian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, AndreaQian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, Andre

    Audio-visual tracking of concurrent speakers

    Get PDF
    Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, especially when sound and video are collected with a compact sensing platform. In this paper, we propose a tracker that builds on generative and discriminative audio-visual likelihood models formulated in a particle filtering framework. We localize multiple concurrent speakers with a de-emphasized acoustic map assisted by the image detection-derived 3D video observations. The 3D multimodal observations are either assigned to existing tracks for discriminative likelihood computation or used to initialize new tracks. The generative likelihoods rely on color distribution of the target and the de-emphasized acoustic map value. Experiments on AV16.3 and CAV3D datasets show that the proposed tracker outperforms the uni-modal trackers and the state-of-the-art approaches both in 3D and on the image plane

    Multi-speaker tracking from an audio-visual sensing device

    Get PDF
    Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3D particle filter and greedy data association to combine visual observations, color-based and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3D and on the image plane

    Robust F0 estimation based on a multi-microphone periodicity function for distant-talking speech

    No full text
    This work addresses the problem of deriving F0 from distanttalking speech signals acquired by a microphone network. The method here proposed exploits the redundancy across the channels by jointly processing the different signals. To this purpose, a multi-microphone periodicity function is derived from the magnitude spectrum of all the channels. This function allows to estimate F0 reliably, even under reverberant conditions, without the need of any post-processing or smoothing technique. Experiments, conducted on real data, showed that the proposed frequency-domain algorithm is more suitable than other time-domain based ones

    Time-frequency reassigned features for automatic chord recognition

    No full text
    This paper addresses feature extraction for automatic chord recognition systems. Most chord recognition systems use chroma features as a front-end and some kind of classifier (HMM, SVM or template matching). The vast majority of feature extraction approaches are based on mapping frequency bins from spectrum or constant-Q spectrum to chroma bins. In this work a set of new chroma features that are based on the time-frequency reassignment (TFR) technique is investigated. The proposed feature set was evaluated on the commonly used Beatles dataset and proved to be efficient for the chord recognition task, outperforming standard chroma

    Exploiting inter-microphone agreement for hypothesis combination in distant speech recognition

    No full text
    A multi-microphone hypothesis combination approach, suitable for the distant-talking scenario, is presented in this paper. The method is based on the inter-microphone agreement of information, extracted at speech recognition level. Particularly, temporal information is exploited to organize the clusters that shape the resulting confusion network, and to reduce the global hypothesis search space. As a result, a single combined confusion network is generated from multiple lattices. The approach offers a novel perspective to solutions based on confusion network combination. The method was evaluated in a simulated domestic environment equipped with largely spaced microphones. The experimental evidence sug- gests that results, comparable or, in some cases, better than the state of the art, can be achieved under optimal configura- tions with the proposed method

    Generalized State Coherence Transform for multidimensional localization of multiple sources

    No full text
    In our recent work an effective method for multiple source localization has been proposed under the name of cumulative state coherence transform (cSCT). Exploiting the physical meaning of the frequency-domain blind source separation and the sparse time-frequency dominance of the acoustic sources, multiple reliable TDOAs can be estimated with only two microphones, regardless of the permutation problem and of the microphone spacing. In this paper we present a multidimensional generalization of the cSCT which allows one to localize several sources in the P-dimensional space. An important approximation is made in order to perform a disjoint TDOA estimation over each dimension which reduces the localization problem to linear complexity. Furthermore the approach is invariant to the array geometry and to the assumed acoustic propagation model. Experimental results on simulated data show a precise 2-D localization of 7 sources by only using an array of three elements

    Talker Localization And Speech Enhancement In A Noisy Environment Using A Microphone Array Based Acquisition System

    No full text
    This paper deals with the use of linear microphone arrays for detection, localization and enhancement of a generic acoustic message produced in a noisy environment. A CrosspowerSpectrum Phase based analysis and a Coherence Measure representation are presented, that allow an accurate time delay estimation employed for the acoustic source position hypothesis. Preliminary results in terms of source localization accuracy are given. Once source position is estimated, an enhanced version of the original acoustic message is derived, that can represent the input for a speech recognition system. Keywords: Microphone Arrays, Talker Localization, Speech Enhancement. 1. INTRODUCTION Automatic speech recognizer performance often degrades drastically when employed in conditions that are different with respect to those for which they were designed. One reason for this is environmental noise. Another reason can be inconsistent use of the acoustic transducers during message acquisition. For example,..
    corecore